Paper Note: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Source type: paper Status: Distilled Date added: 2026-05-03

Bibliography

Why It Matters

This paper is useful for understanding how reinforcement learning for vision-language models can fail on heterogeneous medical video tasks when reward scales are not balanced. It also provides a concrete recipe for building a medical video instruction benchmark from existing expert annotations.

Reading Notes

- Video-level: video summarization, critical view of safety, next action prediction, skill assessment. - Segment-level: temporal action grounding, dense video captioning, region captioning. - Frame-level: spatiotemporal grounding.

- Bounding boxes and labels are overlaid on frames for densely annotated surgical datasets. - Whisper-X transcripts and metadata are used for web-sourced medical videos. - GPT-4.1 and Gemini-2.5-Flash generate captions independently for validation.

- Medical terminology precision. - Instrument and anatomy identification. - Specificity versus vagueness. - Clinical procedure context. - Action and state accuracy.

Claims To Distill

Methods And Evidence

- Dataset-task logistic normalization using SFT baseline percentiles. - Median performance maps to normalized reward 0.5. - IQR scaling reduces outlier sensitivity. - Caption tasks combine semantic similarity and medical LLM judge score.

- Accuracy for CVS, NAP, and skill assessment. - mIoU for spatiotemporal grounding and temporal action grounding. - LLM judge scores and F1 for captioning tasks.

- Qwen2.5-VL-7B SFT reaches 0.894 CVS, 0.177 STG, 0.142 TAG@0.3, 3.596 VS LLM, and 2.757 RC LLM. - Qwen2.5-VL-7B MedGRPO improves to 0.896 CVS, 0.202 STG, 0.216 TAG@0.3, 4.184 VS LLM, and 3.442 RC LLM. - NAP decreases from 0.442 to 0.405 because it was not one of the optimized reward tasks.

Related Work

Follow-ups